Skip to content

Conversation

@superdosh
Copy link
Contributor

Initial implementation for a workflow someone developing a new evaluator could use where experiment tracking is managed in mlflow.

I tried to explain what's happening in the README.md, and the best way to see it is via the template jupyter notebook.

Opening as draft for feedback!

@superdosh superdosh self-assigned this May 22, 2025
@github-actions
Copy link

github-actions bot commented May 22, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@superdosh superdosh force-pushed the initial-implementation branch from 2700fe9 to d40f200 Compare May 22, 2025 14:42
@superdosh superdosh force-pushed the initial-implementation branch from d40f200 to 80a1cb0 Compare May 22, 2025 14:44
@superdosh
Copy link
Contributor Author

Apologies for the noise, had the wrong email on my git config, so the CLA check failed. Fixed now.

@superdosh superdosh marked this pull request as ready for review May 27, 2025 19:04
@superdosh superdosh requested a review from a team as a code owner May 27, 2025 19:04
Copy link
Contributor

@bkorycki bkorycki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I haven't tried playing with it yet but the readme makes it seem simple to use which is awesome. I also think that in the future it would be very valuable to also be able to run non-local dvc datasets.
One thing I think we should maybe think about now is standardizing the experiment IDs. This would enable users to more easily dig through a large list of past runs. For example, the id could be automatically constructed from the sut ids/annotator ids/dataset name, and an optional tag. @bollacker may have more thoughts here.
Other than that I don't have any major notes! Looks like a solid first implementation to me.:)

help="The number of jobs to run in parallel. Defaults to 1.",
)
@load_from_dotenv
def get_responses(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nit-pick but can you re-name the response/responder stuff to be something more specific to suts? Because "response" isn't a term unique to suts imho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely! Is there a term we use elsewhere that would make sense to re-use here? Or could just be get_sut_responses ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think get_sut_responses is good:)

@superdosh
Copy link
Contributor Author

superdosh commented May 27, 2025

One thing I think we should maybe think about now is standardizing the experiment IDs. This would enable users to more easily dig through a large list of past runs. For example, the id could be automatically constructed from the sut ids/annotator ids/dataset name, and an optional tag.

@bkorycki , I like this idea, though I worry if we use all of those things to construct the experiment name, it'll be too long. One thing is that if we're good about the tagging, that should enable the nice searching. Though we're currently missing the sut_id tag from the annotator run; I'll add that.

I'll add a note on this to the README under the TODOs!

@superdosh superdosh merged commit 576e5cc into main May 28, 2025
1 check passed
@github-actions github-actions bot locked and limited conversation to collaborators May 28, 2025
@superdosh
Copy link
Contributor Author

@bollacker merging so we can branch off of main for the next steps, but please still comment, if you like!

@superdosh superdosh deleted the initial-implementation branch May 30, 2025 19:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants